Goto

Collaborating Authors

 initialization scheme


On the Convergence of Encoder-only Shallow Transformers

Neural Information Processing Systems

Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization.